ISSN 0439-755X
CN 11-1911/B

Acta Psychologica Sinica ›› 2023, Vol. 55 ›› Issue (7): 1192-1206.doi: 10.3724/SP.J.1041.2023.01192

• Reports of Empirical Studies • Previous Articles    

Missing data analysis in cognitive diagnostic models: Random forest threshold imputation method

YOU Xiaofeng1, YANG Jianqin1, Qin Chunying1, LIU Hongyun2,3   

  1. 1School of Mathematics and Information Science, Nanchang Normal University, Nanchang 330022, China;
    2Beijing Key Laboratory of Applied Experimental Psychology, Beijing Normal University, Beijing 100875, China;
    3Faculty of Psychology, Beijing Normal University, Beijing 100875, China
  • Received:2022-04-23 Published:2023-07-25 Online:2023-04-21

Abstract: In recent years, interest in cognitive diagnostic assessments (CDAs), as a new form of test, has increased drastically. Due to the specific design of the test, missing data is an inevitable problem in CDAs. Proper handling of missing data in CDAs is important to provide accurate diagnostic feedback to students and teachers. With the use of machine learning in education, relevant advancements have been made in missing data imputation. Research showed machine learning techniques have more desirable features for missing data imputation than traditional approaches. The random forest algorithm has been extended to become the random forest imputation (RFI) method in handling of CDAs missing data for CDAs. The method takes into consideration the characteristics of the data rather than assumes certain missing mechanism. RFI is a new non-parametric method that makes full use of the available response information and characteristics of response patterns to impute missing data.
Making use of advantages of RFI in categorization/prediction and its non-reliant on missing mechanism type, we improved and proposed the new random forest threshold imputation (RFTI) method. It could be used to impute missing responses in the widely used DINA (Deterministic Inputs, Noise “And” Gate) model. This research proposed to apply the Response Conformity Index (RCI) in the missing data imputation to set the threshold of imputation and to develop a method for missing response treatment for CDAs without totally relying on imputation. Two simulation studies were conducted to compare the performance of the proposed method and traditional models. Study 1 began by introducing the theoretical background and algorithm implementation of RFTI. Then, RFTI and RFI were compared in terms of accuracy rate of imputation for data with different proportions of missingness (10%, 20%, 30%, 40%, 50%) and missing data mechanisms (MIXED, MNAR, MAR, MCAR). This was to affirm the necessity of including RCI during imputation. Study 2 aimed to investigate the performance of RFTI, as well as RFI and EM algorithm in imputing missing data under different conditions. The manipulated design factors were identical to those in Study 1. We evaluated RFTI in terms of its accuracy in assessing the model attributes and item parameters. We also compared RFTI against the traditionally better performed EM and RFI under various design conditions to explore the advantages and conditions of using RFTI.
Results of Study 1 showed that RFTI, as compared to RFI, improved accuracy when imputation threshold was one. In various design conditions, RFTI imputation rate and accuracy were also better. Study 2 showed that RFTI outperformed other methods (RFI, EM algorithm) in accurately assessing the attribute pattern and attribute margin. This advantage was affected by the missing data mechanism and the proportion of missing data. Notably, RFTI was particularly better than other methods in handling mixed type of missing or MNAR data, and when the proportion of missing data was higher than 30%. However, RFTI was not any better than other methods in its accuracy of item parameter estimates. In most conditions, EM algorithm provided the most accurate parameter estimates.
In sum, we propose a method to impute missing data in CDAs by applying machine learning methods in measurement models. The advantage of this new method is affirmed through its accurate assessment of attribute pattern and attribute margin of DINA model. Theoretically, the current study provides a missing data imputation approach with less assumptions, which extends the traditional methods to impute missing data in CDAs framework. Moreover, we investigate how to estimate the attribute pattern of students accurately through the responses of a few items. It sheds lights on imputing missing data due to particularly designs in assessment or teaching.

Key words: missing data, cognitive diagnostic assessment, random forest threshold imputation, random forest imputation, expectation-maximization algorithm

CLC Number: